【AI筆記】30天從論文入門到 Pytorch 實戰：全面評估你的模型表現 Day 19

16th鐵人賽

fan84sunny

2024-08-22 00:03:52

93 瀏覽

分享至

圖像生成

評估指標

假設今天生成的圖像已經完成，並放置在以下資料夾中：

/home/user/experiments/experiments_231220/train_mask_val_model_ad_590000/generate_result

在 T2I 的評估中常用的指標包括 CLIP Score 和 FID（Fréchet Inception Distance）。這次我們需要檢查原始論文中使用了哪些評估指標來進行評分。

根據我們之前的討論【AI筆記】30天從論文入門到 Pytorch 實戰：模型評估指標與應用 Day 9中提到的評估方式主要是使用 CLIP Score 和 FID。CLIP Score 是利用 CLIP 模型來評估生成圖像與文本描述之間的一致性，而 FID 則用於衡量生成圖像與真實圖像之間的距離。

然而，原始代碼中並沒有提供具體的評分方式，因此我們需要參考原文來了解具體的實施細節。根據這篇回覆，研究者是隨機選擇一組 caption 來生成一張圖片，並使用這些生成的圖像進行評估。

論文

https://arxiv.org/pdf/2302.08453
在原始論文的 4.2. Comparison 部分中，提到利用包含 5,000 張影像的 COCO 驗證集來評估每種方法。對於每張圖像，不同的方法只隨機推理一次作為最終結果。在評估方法上研究者採用了兩種指標：FID（Fréchet Inception Distance）和 CLIP Score（使用 ViT-L/14 模型）。這樣的設計是為了模擬實際應用中的情況，避免多次推理帶來的偏差。

然而 COCO 資料集中每張圖像至少有五個標註的 caption，這就引發了一個問題：在進行推理時，應該選擇哪一個 caption？經過查閱所有 GitHub issue，我們發現此問題在這篇回覆中得到了回答。研究者是隨機選擇一組 caption 來生成一張圖片，因此總共生成的圖像應該有 5,000 張。

這樣的隨機選擇方法能夠保證評估的公平性和多樣性，避免了因為固定選擇某一個 caption 而帶來的偏差。同時，這也模擬了實際應用中的情況，因為在真實場景中，生成模型通常需要處理多樣化的輸入。

載入圖片

先要確定要取得怎樣的資料才能完美的計算評估指標。
以下是你需要準備的資料：

原始圖：這是你用來生成圖像的基礎圖像。
額外條件：這些可能是用來輔助生成圖像的條件，例如遮罩、標籤等。
Caption：每張原始圖像的描述文字，這些描述會用來評估生成圖像與描述的一致性。
生成圖：根據原始圖和額外條件生成的圖像。

我這邊習慣保留原始圖的命名方式並把生成圖前面多gen_
假設你的原始圖命名為 image_001.jpg，那麼生成圖可以命名為 gen_image_001.jpg。這樣可以方便地對應和管理。

這邊使用 filename = '/home/user/dict_name_sentence.pkl' 來記錄生成圖所使用的對應 caption，這是一個很好的做法。這樣可以方便地追蹤每張生成圖所使用的 caption，並在評估時進行對應。之後 CLIP Score 會需要用到 Caption 資訊。

class dataset_coco_gener_image():
    def __init__(self, path_json, root_path_im, root_path_mask, root_path_gener, filename = '/home/user/dict_name_sentence.pkl'):
        super(dataset_coco_gener_image, self).__init__()
        with open(path_json, 'r', encoding='utf-8') as fp:
            data = json.load(fp)
        data = data['annotations']
        self.files = []
        self.root_path_im = root_path_im
        self.root_path_mask = root_path_mask
        self.root_path_gener = root_path_gener
        log_root = os.path.dirname(self.root_path_gener)
        self.filename = filename
        # filename = os.path.join(log_root, 'dict_name_sentence.pkl')
        f_read = open(filename, 'rb')
        files = pickle.load(f_read)
        f_read.close()
        self.files = files
        # for file in data:
        #     name = "%012d.jpg" % file['image_id']
        #     self.files.append({'name': name, 'sentence': file['caption']})

    def __getitem__(self, idx):
        file = self.files[idx]
        name = file['name']
        
        im = os.path.join(self.root_path_im, name)
        mask = cv2.imread(os.path.join(self.root_path_mask, name))
        
        gen_name = "gen_"+ file['name']
        gen_im = os.path.join(self.root_path_gener, gen_name)  # [:,:,0]

        sentence = file['sentence']
        return {'im': im, 'mask': mask, 'gen_im':gen_im, 'sentence': sentence, 'name': name}

    def __len__(self):
        return len(self.files)

計算 FID

使用原始圖和生成圖來計算 FID。

只要把真實資料 real_images_folder和生成資料 generated_images_folder 分割兩個資料夾就會自己計算分數。

dims: 設定feature維度 Dimensionality of features returned by Inception。
用Inception模型去提取特徵計算分布，然後計算真實資料分佈和生成模型的圖像資料分布的距離

from pytorch_fid import fid_score

def fid(model ,real_images_folder, generated_images_folder, device):
    # 準備真實資料和生成模型的圖像資料
    real_images_folder = real_images_folder
    generated_images_folder = generated_images_folder

    fid_value = fid_score.calculate_fid_given_paths([real_images_folder, generated_images_folder], 1, device, dims= 2048)
    return fid_value

計算 CLIP Score

使用生成圖和對應的 caption 來計算 CLIP Score。

https://github.com/openai/CLIP
如何使用 CLIP 模型來計算生成圖像與對應 caption 之間的相似度分數，使用的是 ViT-B/32 模型。

當初沒有那麼多API可以用，只能自己寫，現在有簡易的可以直接call....
https://lightning.ai/docs/torchmetrics/stable/multimodal/clip_score.html

準備資料集和 Dataloader

這段程式碼創建了一個驗證資料集和數據加載器，從指定的路徑中讀取原始圖像、遮罩、生成圖像和對應的 caption。

import clip
# CLIP score
model, preprocess = clip.load("ViT-B/32", device=device)

val_dataset = dataset_coco_gener_image(path_json_val,
                root_path_im=real_images_folder, root_path_mask='/home/user/datasets/coco/mask/val2017_color',
                root_path_gener=generated_images_folder,filename=dict_file
)
val_dataloader = torch.utils.data.DataLoader(
        val_dataset,
        batch_size=1,
        shuffle=False,
        num_workers=1,
        pin_memory=False)

計算 CLIP 相似度分數

這段程式碼遍歷驗證數據加載器，對每張生成圖像和對應的 caption 計算 CLIP 相似度分數。具體步驟如下：

使用預處理函數處理生成圖像，並將其轉換為張量。
將 caption 轉換為 CLIP 模型可接受的格式。
使用 CLIP 模型分別編碼圖像和文本，得到圖像特徵和文本特徵。
計算圖像特徵和文本特徵之間的餘弦相似度，並將結果存儲在 clip_sim_score_ 列表中。

clip_sim_score = []
logger.info('Starting to calculate CLIP!')
logger.info(f'Using dict name and image:{dict_file}')
clip_sim_score_ = []
for i in tqdm(range(1)):
    for data in val_dataloader:
        with torch.no_grad():
            image = preprocess(Image.open(data['gen_im'][0])).unsqueeze(0).to(device)
            text = clip.tokenize(data['sentence']).to(device)
            image_features = model.encode_image(image)
            text_features = model.encode_text(text)

            sim = F.cosine_similarity(image_features, text_features)
            sim = sim.cpu().data.numpy().astype(np.float32)
            clip_sim_score_.append(sim)

clip_sim_score_np = np.array(clip_sim_score_)
logger.info('Mean sim_value: {:.4f}'.format(np.mean(clip_sim_score_np)))

輸出

2023-12-06 21:55:29,283 INFO: Evaluation on: /home/user/experiments/train_mask_val_model_ad_590000/generate_result
2023-12-06 21:55:29,283 INFO: Starting to calculate FID!
2023-12-06 21:57:52,989 INFO: fid_value: 19.53518798600112
2023-12-06 21:57:55,958 INFO: Starting to calculate CLIP!
2023-12-06 22:01:05,924 INFO: fid_value: 19.5352
2023-12-06 22:01:05,925 INFO: Mean sim_value: 0.3192
2023-12-06 22:01:05,925 INFO: DONE!